Add non-power-of-2 shapes for Morton coding to benchmarks#3717
Add non-power-of-2 shapes for Morton coding to benchmarks#3717d-v-b merged 5 commits intozarr-developers:mainfrom
Conversation
Add (30,30,30) to large_morton_shards and (10,10,10), (20,20,20), (30,30,30) to morton_iter_shapes to benchmark the scalar fallback path for non-power-of-2 shapes, which are not fully covered by the vectorized hypercube path. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Documents the performance penalty when a shard shape is just above a power-of-2 boundary, causing n_z to jump from 32,768 to 262,144. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Benchmark ResultsThese benchmarks were run on this branch (which includes the vectorized
|
| Shape | Elements | Type | Mean time |
|---|---|---|---|
(8,8,8) |
512 | power-of-2 | 0.45 ms |
(16,16,16) |
4,096 | power-of-2 | 3.6 ms |
(32,32,32) |
32,768 | power-of-2 | 28.9 ms |
(10,10,10) |
1,000 | non-power-of-2 | 9.6 ms |
(20,20,20) |
8,000 | non-power-of-2 | 88.2 ms |
(30,30,30) |
27,000 | non-power-of-2 | 125.6 ms |
(33,33,33) |
35,937 | near-miss (+1 above 32³) | 767 ms |
The near-miss penalty is striking: (33,33,33) has only ~10% more elements than (32,32,32) but takes 27× longer. This is because the current floor-hypercube approach must scalar-decode many Morton codes beyond the guaranteed in-bounds region.
test_sharded_morton_write_single_chunk — write 1 chunk to a large shard, cache cleared each round
| Shape | Chunks/shard | Mean time |
|---|---|---|
(32,32,32) |
32,768 | 35.7 ms |
(30,30,30) |
27,000 | 127.5 ms |
(33,33,33) |
35,937 | 767.8 ms |
test_sharded_morton_single_chunk — read 1 chunk from a large shard (cached after first access)
| Shape | Mean time |
|---|---|
(32,32,32) |
0.73 ms |
(30,30,30) |
0.69 ms |
(33,33,33) |
0.71 ms |
Reads are fast across all shapes once the Morton order cache is warm (the first call pays the penalty, subsequent reads are cached).
Interpretation
The benchmarks confirm that non-power-of-2 shard shapes carry a significant Morton computation penalty under the current implementation, with near-miss shapes (like (33,33,33)) being especially slow. These benchmarks provide a baseline to measure improvements from follow-on optimization work.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ort strategy (#3718) * tests: Add non-power-of-2 shard shapes to benchmarks Add (30,30,30) to large_morton_shards and (10,10,10), (20,20,20), (30,30,30) to morton_iter_shapes to benchmark the scalar fallback path for non-power-of-2 shapes, which are not fully covered by the vectorized hypercube path. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * tests: Add near-miss power-of-2 shape (33,33,33) to benchmarks Documents the performance penalty when a shard shape is just above a power-of-2 boundary, causing n_z to jump from 32,768 to 262,144. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * style: Apply ruff format to benchmark file Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * changes: Add changelog entry for PR #3717 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * perf: Fix near-miss penalty in _morton_order with hybrid ceiling+argsort strategy For shapes just above a power-of-2 (e.g. (33,33,33)), the ceiling-only approach generates n_z=262,144 Morton codes for only 35,937 valid coordinates (7.3× overgeneration). The floor+scalar approach is even worse since the scalar loop iterates n_z-n_floor times (229,376 for (33,33,33)), not n_total-n_floor. The fix: when n_z > 4*n_total, use an argsort strategy that enumerates all n_total valid coordinates via meshgrid, encodes each to a Morton code using vectorized bit manipulation, then sorts by Morton code. This avoids the large overgeneration while remaining fully vectorized. Result for test_morton_order_iter: (30,30,30): 24ms (ceiling, ratio=1.21) (32,32,32): 28ms (ceiling, ratio=1.00) (33,33,33): 32ms (argsort, ratio=7.3 → fixed from ~820ms with scalar) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix: Address pre-commit CI failures in _morton_order - Replace Unicode multiplication sign × with ASCII x in comment (RUF003) - Add explicit type annotation for np.argsort result to satisfy mypy Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix: Cast argsort result via np.asarray to resolve mypy no-any-return np.stack returns Any in mypy's view, so indexing into it also returns Any. Using np.asarray(..., dtype=np.intp) makes the type explicit and avoids the no-any-return error at the return site. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * fix: Pre-declare order type to resolve mypy no-any-return in _morton_order np.asarray and np.stack return Any with numpy 2.1 type stubs, causing mypy to infer the return type as Any. Pre-declaring order as npt.NDArray[np.intp] before the if/else makes the intended type explicit. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> Co-authored-by: Davis Bennett <davis.v.bennett@gmail.com>
benchmarks
[Description of PR]
TODO:
docs/user-guide/*.mdchanges/